AITopics | relational self-attention

Relational Self-Attention: What's Missing in Attention for Video Understanding Supplementary Material

Neural Information Processing SystemsApr-25-2026, 15:32:23 GMT

We use TSN-ResNet [11] as our backbone (see Table 1) and initialize it with ImageNet-pretrained weights [4]. We replace its 7 spatial convolutional layers with the RSA layers; for every two ResNet blocks from the third block in res2 to the second block in res5, each spatial convolutional layer is replaced with the RSA layer. For the bottlenecks including RSA layers, we randomly initialize weights using MSRA initialization [3] and set the gamma parameter of the last batch normalization layer to zero. We resize the resolution of each frame to 240 320, and apply random cropping as 224 224, scale jittering, and random horizontal flipping for data augmentation. Note that we do not flip videos of which action labels include'left' or'right' words, e.g., 'pulling something from left to right'.

artificial intelligence, machine learning, video understanding, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.55)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.40)

Add feedback

Relational Self-Attention: What's Missing in Attention for Video Understanding

Neural Information Processing SystemsDec-24-2025, 01:08:41 GMT

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and FineGym.

feature transform, name change, relational self-attention, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.61)

Add feedback

Relational Self-Attention: What's Missing in Attention for Video Understanding

Neural Information Processing SystemsOct-10-2024, 04:18:27 GMT

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and FineGym.

feature transform, relational self-attention

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Video Understanding (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.65)

Add feedback

Filters

Collaborating Authors

relational self-attention

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Relational Self-Attention: What's Missing in Attention for Video Understanding Supplementary Material

Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding